2020-08-01

Why geospatial analysis?

  • Assess spatial as well as temporal variation in disease prevalence
  • Identify risk factors, areas of high risk
  • Allocating resources efficiently

Research questions

  • Heterogenous/patchy prevalence of onchoceriasis
Onchocerciasis prevalence map

Onchocerciasis prevalence map

Research questions?

  • What are ecological drivers of onchocerciasis prevalence?
  • What are the ecological drivers vector and parasite distribution?
  • Do factors affecting vector and parasite distribution also govern onchocerciasis prevalence?
  • Why is the central transition zone of Ghana apparently one big fly zone, but the vector population distribution, more fragmented in Ethiopia?

Data

Source of data

  • Database of onchocerciais prevalence data (Hill et. al., 2019).
  • It consists of data from 31 countries including some South American countries.
  • 1900 observations from 1972 to 2015 (African countries)
    • 1669 have known GIS coordinates
    • 231 do not have GIS coordinates, but have reference to polygon shape files

Prevalence data

Onchocerciasis prevalence data on different decades

Prevalence data based on diagnostic tests

Counts of different diagnostic tests

Different diagnostic test in different decades

Different diagnostic test in different decades

Building geo-spatial modelling framework

Methods considered

  • Conventional approach
    • Generalised linear model (GLM)
    • Kriging
  • Machine learning approach: Random Forest models
  • Bayesian approach
    • Posterior probability distribution estimation
      • By Markov Chain Monte Carlo (MCMC) simulation
      • By Integrated Nested Laplace Approximation (INLA)

Data used

  • Ethiopian prevalence data
  • Why Ethiopia?
    • No geospatial studies on Eastern Africa
    • Mixed endemicity with geographical diversity

Covariates for prevalence prediction

  • Climate variables from worldclim.org
  • Altitude
  • Population density

Covariates for prevalence prediction

Raster layer for some of the predictors

Raster layer for some of the predictors

Generalized linear model

Regression coefficients
2.5 % 97.5 %
(Intercept) -278.3242328 -455.1229757 -101.5254900
alt 0.0169194 -0.0050694 0.0389083
isothermality 2.1396189 0.6583329 3.6209048
temperature.seasonality 0.0016484 -0.0198791 0.0231760
annual.precp 0.0212815 0.0083844 0.0341786
popden -0.2144603 -0.8037939 0.3748734
annual.mean.temp 0.3790577 0.0113506 0.7467647

GLM Output

Predicted prevalence map by Generalized Linear Model

Kriging

  • accounts spatial autocorrelation with a variogram
    Sample variogram with different parameters

    Sample variogram with different parameters

Kriging on Ethiopian data

Wave variogram fitted for Ethiopian prevalence data

Wave variogram fitted for Ethiopian prevalence data

Kriging output

Predicted prevalence map by Ordinary Kriging

Random Forest regression

  • Machine learning techniques are well known for handling multidimensional data
  • Comprised of numerous decision/regression trees
    Decrease in model prediction error with increase in number of trees

    Decrease in model prediction error with increase in number of trees

Random forest: variable importance

Variable importance plot

Variable importance plot

Random Forest output

Predicted median prevalence map by Random Forest

Random Forest output

Bayesian Approach

  • Realistic approach for estimation of model parameters
  • Allows to incorporate prior knowledge about model parameters
    Bayesian equation for parameter estimation

    Bayesian equation for parameter estimation

Markov Chain Monte Carlo

  • Traditionally used alogrith for calculating posterior probability
  • Used by Hanlon et. al.(2016) using package geoRglm
  • Computationally demanding, model ran by reducing number of iterations, prediction location and number of covariates
  • Priors
    • Normal distribution for regression coefficients
    • Uniform distribution for range parameter

MCMC Output

Traceplot for parameters estimated

Traceplot for parameters estimated

MCMC Output

Integrated Nested Laplace Approximation (INLA)

  • Faster than MCMC, allows to fit larger dataset
  • Converts continuous spatial field to a discrete spatial field defined on a triangulated mesh
    Triangulated Mesh for Ethiopian prevalence data

    Triangulated Mesh for Ethiopian prevalence data

INLA Output

Regression coefficients
X mean sd X0.025quant X0.5quant X0.975quant
b0 -10.7019100 27.5594618 -65.0088243 -10.6591201 43.3169943
altitude 0.0047719 0.0035015 -0.0018561 0.0046585 0.0120808
isothermality -0.4135565 0.2971715 -1.0255053 -0.4054335 0.1500352
temp.season -0.0079408 0.0059171 -0.0208241 -0.0075787 0.0027440
annual.precp 0.0134181 0.0043420 0.0053399 0.0132497 0.0225356
annual.temp 0.1110413 0.0695194 -0.0163252 0.1076159 0.2587491
popden -0.2375209 0.0287466 -0.2967719 -0.2368306 -0.1822866

INLA Output

Predicted median prevalence map by Random Forest

INLA output

Conclusion

  • Bayesian methods are robust but computationally demanding
  • No effective way to quantify effects of covariates with Random Forest model
  • Geospatial analysis shifting towards INLA approach

Next steps

  • Curating prevalence data
  • Running codes on cluster
  • Setting priors
  • Downloading and selecting covariates for model
  • Model evaluation

Thank you

Any Questions?